Design and Impact of Automated Decision Systems – My Machine Learning Projects Blogs

Abstract and Methodologies

Automated decision systems are increasingly used in financial institutions to assess credit risk and determine loan eligibility. In this blog post, we build upon the theoretical framework of binary decision-making with a linear score function, applying it to a more realistic credit-risk prediction scenario. Our goal is twofold: first, to develop a score function and threshold that optimize a bank’s total expected profit while considering various borrower features; and second, to assess how this decision system impacts different demographic segments. By leveraging data-driven modeling, visualization, and profit-based optimization, we aim to create a more informed and equitable approach to automated lending decisions.

Loading the data

import pandas as pd
url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/train.csv"
df_train = pd.read_csv(url)

df_train.head()

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	25	43200	RENT	NaN	VENTURE	B	1200	9.91	0	0.03	N	4
1	27	98000	RENT	3.0	EDUCATION	C	11750	13.47	0	0.12	Y	6
2	22	36996	RENT	5.0	EDUCATION	A	10000	7.51	0	0.27	N	4
3	24	26000	RENT	2.0	MEDICAL	C	1325	12.87	1	0.05	N	4
4	29	53004	MORTGAGE	2.0	HOMEIMPROVEMENT	A	15000	9.63	0	0.28	N	10

Analysis of Loan Intent by Age Group

This bar chart below shows how different age groups use their loans for various purposes—such as venture, education, medical, home improvement, personal, and debt consolidation. We can see that borrowers aged 18–29 make up a large portion of total loans, often driven by education and personal loan needs. As age increases, the number of loans generally decreases, but certain categories—like debt consolidation—can become more common in older groups.

Overall, this chart highlights that younger borrowers borrow a lot more money may be more focused on educational or personal financing, while older borrowers might shift their attention to consolidating debt or improving their homes. Understanding these patterns help us in understanding different patternns in who would default on a loan and not.

import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/credit-risk/train.csv"
df = pd.read_csv(url)

# Create additional features for analysis:
# 1. Age groups: we create bins to see how loan intent varies with age.
age_bins = [18, 30, 40, 50, 60, 100]
age_labels = ['18-29', '30-39', '40-49', '50-59', '60+']
df['age_group'] = pd.cut(df['person_age'], bins=age_bins, labels=age_labels)

# ---------------------------
# Visualization 1:
# How does loan intent vary with age and home ownership status?
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='age_group', hue='loan_intent')
plt.title('Loan Intent Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Loan Intent')
plt.tight_layout()
plt.show()

Analysis of Average Loan Amount by Credit History Length

This bar chart shows how the average loan amount changes based on the number of years a borrower has had a credit history. In general, we see that some longer lengths of credit history are associated with higher average loan amounts than others, though the pattern isn’t strictly increasing or decreasing. This suggests that lenders may be willing to extend larger lines of credit to individuals with certain credit history profiles.

For our automated decision system, credit history length could be an important feature because it often reflects a borrower’s past experience with credit and repayment behavior. However, we must be mindful of fairness and potential biases: borrowers who are younger or newer to credit might be at a disadvantage if the model heavily weighs credit history length. Balancing profitability for the bank with equitable access to credit remains a key challenge in designing our scoring and thresholding methods.


# 2. Employment length groups: useful for exploring patterns with job experience.
emp_bins = [0, 2, 5, 10, df['person_emp_length'].max()]
emp_labels = ['0-1 yrs', '2-4 yrs', '5-9 yrs', '10+ yrs']
df['emp_length_group'] = pd.cut(df['person_emp_length'], bins=emp_bins, labels=emp_labels)

# ---------------------------
# Visualization 2:
# Which segments are offered different interest rates? Compare distributions by home ownership.
plt.figure(figsize=(10, 6))
avg_loan = df.groupby('cb_person_cred_hist_length')['loan_amnt'].mean().reset_index()
sns.barplot(data=avg_loan, x='cb_person_cred_hist_length', y='loan_amnt')
plt.title('Average Loan Amount by Credit History Length')
plt.xlabel('Credit History Length (Years)')
plt.ylabel('Average Loan Amount')
plt.tight_layout()
plt.show()

Analysis of Loan Amount vs. Loan % of Income by Home Ownership

In our scatter plot below, each dot represents a borrower, with the x-axis showing how large the loan is relative to their income (as a percentage) and the y-axis showing the absolute loan amount. The colors indicate different types of home ownership (RENT, MORTGAGE, OWN, OTHER).

As you can see, the data points overlap heavily, making the chart look cluttered. To get a clearer picture, I wo;; split these data into separate graphs for each home ownership category. This will help us see more nuanced patterns—like whether renters tend to have higher loan-to-income ratios compared to those who own or have a mortgage.

df['loan_percent_income_pct'] = df['loan_percent_income'] * 100
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df, 
    x='loan_percent_income_pct', 
    y='loan_amnt', 
    hue='person_home_ownership',
    alpha=0.7
)
plt.title('Loan Amount vs. Loan % of Income by Home Ownership')
plt.xlabel('Loan as % of Person Income')
plt.ylabel('Loan Amount')
plt.xlim(0, 100)  # Focus on 0–100% if most loans fall in this range
plt.tight_layout()
plt.show()

Analysis of Loan Amount vs. Loan % of Income by Home Ownership (Separated Plots)

By splitting the data into four subplots (one for each home ownership category), we can see that:

Renters typically have loan amounts averaging around $10,000–$12,000, with a wide spread of loan-to-income ratios (averaging around 15–20%). In addition to this we see the most clustering for this group which means, in our dataset most of the people taking our loans are from this group.
Mortgage holders often take out larger loans (averaging $16,000–$18,000) but may have lower loan-to-income ratios (closer to 10% on average).
Owners (those who fully own their homes) tend to borrow moderate amounts ($12,000–$15,000) at ratios of around 12–15%.
Others (less common categories) show a broad mix but generally fall between these ranges.

These distinctions are important for our automated decision system, since each home ownership group presents a different risk and borrowing profile. When designing a score function and threshold to maximize the bank’s profit, it’s important for us to consider whether certain groups (like renters) might be unfairly penalized if they tend to have higher loan-to-income ratios. Ultimately, these separate plots help us fine-tune our model so that we balance profitability with equitable access to credit across different segments of borrowers.



# Create a FacetGrid: one subplot per home ownership category
g = sns.FacetGrid(df, col="person_home_ownership", col_wrap=2, height=4)
g.map(sns.scatterplot, "loan_percent_income_pct", "loan_amnt", alpha=0.7)

# Set the x-axis limits and labels for clarity
g.set(xlim=(0, 100))
g.set_axis_labels("Loan as % of Income", "Loan Amount")
plt.tight_layout()
plt.show()

Summary of Loan Intent and Home Ownership Segments

This table shows how different combinations of loan intent (e.g., EDUCATION, MEDICAL, PERSONAL) and home ownership (MORTGAGE, OWN, RENT, OTHER) compare in terms of average interest rate, average loan amount, and count of borrowers. We’ve sorted the table by average loan amount in descending order to identify which segments receive the largest lines of credit.

Highest Averages: Segments like PERSONAL–OTHER and MEDICAL–OTHER appear near the top, suggesting they receive higher loan amounts (over $12,000 on average), but also tend to have higher interest rates (11–12%).
Mortgage vs. Rent: Many MORTGAGE segments (e.g., DEBTCONSOLIDATION–MORTGAGE, EDUCATION–MORTGAGE) cluster in the middle, with average loan amounts around $10,000–$11,000 and interest rates near 10–10.6%. Renters often see slightly higher interest rates (11–12%) and somewhat lower loan amounts (around $8,000–$9,000).
Low Counts: Some segments have very few borrowers (like DEBTCONSOLIDATION–OWN with a count of only 62), which may not be reliable for broad conclusions.

From the perspective of building an automated decision system, these patterns hint at where the bank’s profit opportunities and risks might lie. For instance, segments with higher average loan amounts but also higher interest rates could be more profitable—but might also carry greater default risk. Tracking how many borrowers fall into each segment (the “count” column) helps ensure the model doesn’t overly focus on small, potentially unrepresentative groups.

summary_table = (
    df
    .groupby(['loan_intent', 'person_home_ownership'], as_index=False)
    .agg(
        avg_interest_rate=('loan_int_rate', 'mean'),
        avg_loan_amount=('loan_amnt', 'mean'),
        count=('loan_amnt', 'count')  # how many borrowers in each segment
    )
)

# Sort by average loan amount (descending) to see which segments get the largest lines of credit
summary_table_sorted_by_amount = summary_table.sort_values('avg_loan_amount', ascending=False)

summary_table_sorted_by_amount

	loan_intent	person_home_ownership	avg_interest_rate	avg_loan_amount	count
17	PERSONAL	OTHER	11.675714	12366.666667	15
13	MEDICAL	OTHER	12.745000	12200.000000	13
5	EDUCATION	OTHER	12.400833	12142.857143	14
9	HOMEIMPROVEMENT	OTHER	11.683000	10959.090909	11
8	HOMEIMPROVEMENT	MORTGAGE	10.613916	10764.017341	1384
20	VENTURE	MORTGAGE	10.468000	10606.281060	1811
0	DEBTCONSOLIDATION	MORTGAGE	10.400489	10588.756111	1841
4	EDUCATION	MORTGAGE	10.554563	10502.178076	2089
12	MEDICAL	MORTGAGE	10.505553	10485.867052	1730
16	PERSONAL	MORTGAGE	10.426285	10481.223233	1868
21	VENTURE	OTHER	12.274211	10367.500000	20
11	HOMEIMPROVEMENT	RENT	11.812287	10109.604633	1252
1	DEBTCONSOLIDATION	OTHER	11.566667	9783.333333	15
14	MEDICAL	OWN	10.749097	9367.755682	352
10	HOMEIMPROVEMENT	OWN	10.922405	9242.450980	255
6	EDUCATION	OWN	10.797507	8996.177184	412
18	PERSONAL	OWN	10.867262	8944.209040	354
3	DEBTCONSOLIDATION	RENT	11.358223	8882.754425	2260
19	PERSONAL	RENT	11.535415	8826.900046	2171
23	VENTURE	RENT	11.438455	8804.566745	2135
22	VENTURE	OWN	10.560000	8789.621914	648
7	EDUCATION	RENT	11.315216	8685.308193	2612
15	MEDICAL	RENT	11.422580	8426.925182	2740
2	DEBTCONSOLIDATION	OWN	14.432909	7749.193548	62

Training and evaluating our Logistic Regression model

I used a logistic regression model to predict whether a prospective borrower will default on a loan. After preprocessing the data—by standardizing numerical features and one-hot encoding categorical variables. I removed rows with missing values, and split the dataset into training and test sets. The model achieved a test accuracy of about 84.2%.

The confusion matrix provides additional insight into the model’s performance:

True Negatives (TN): 3,393 borrowers who did not default and were correctly predicted as non-default.
False Positives (FP): 221 borrowers who did not default but were incorrectly flagged as defaults.
False Negatives (FN): 501 borrowers who defaulted but were missed by the model.
True Positives (TP): 467 borrowers who defaulted and were correctly identified.

Our results suggest that while the model performs reasonably well overall, there is still a balance to be struck between avoiding false positives and false negatives. This is particularly important when designing an automated decision system for credit risk, because both profitability for the bank and equitable access to credit are critical. Further tuning of the threshold and exploration of additional features could help optimize the model even further for its intended purpose

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Drop rows with missing values
df_train = df_train.dropna(subset=numeric_features + categorical_features)




target = 'loan_status'
X = df_train.drop(columns=[target])
y = df_train[target]

numeric_features = ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 
                      'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length']
categorical_features = ['person_home_ownership', 'loan_intent', 'cb_person_default_on_file']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])
X_transformed = preprocessor.fit_transform(X)

# Split the transformed data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_transformed, y, test_size=0.2, random_state=123
)

# Fit a logistic regression model using the preprocessed features
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Test Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Test Accuracy: 0.8424268878219118
Confusion Matrix:
 [[3393  221]
 [ 501  467]]

Finding weight and threshold

df['loan_int_rate_decimal'] = df['loan_int_rate'] / 100.0

def profit_if_repaid(loan_amnt, loan_int_rate_decimal):
    """
    If the loan is repaid in full, the bank's profit is:
      loan_amnt * (1 + 0.25*loan_int_rate)^10 - loan_amnt
    """
    return loan_amnt * (1 + 0.25 * loan_int_rate) ** 10 - loan_amnt

def profit_if_default(loan_amnt, loan_int_rate_decimal):
    """
    If the borrower defaults, we assume default happens 3 years into the loan, 
    and the bank loses 70% of the principal:
      loan_amnt*(1 + 0.25*loan_int_rate)^3 - 1.7*loan_amnt
    """
    return loan_amnt * (1 + 0.25 * loan_int_rate) ** 3 - 1.7 * loan_amnt

We will now compute predicted probabilities (probability of default) for the training data

y_prob_train = model.predict_proba(X_train)[:, 1]  # column 1 = probability of default

# Split the transformed data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)
loan_amnt_array = X_train['loan_amnt'].to_numpy()
loan_int_rate_array = X_train['loan_int_rate_decimal'].to_numpy()
y_train_array = y_train.to_numpy()
profit_repaid = profit_if_repaid(loan_amnt_array, loan_int_rate_array)
profit_default = profit_if_default(loan_amnt_array, loan_int_rate_array)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/ml-0451/lib/python3.11/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'loan_int_rate_decimal'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb Cell 22 line 6
      <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=1'>2</a> X_train, X_test, y_train, y_test = train_test_split(
      <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=2'>3</a>     X, y, test_size=0.2, random_state=123
      <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=3'>4</a> )
      <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=4'>5</a> loan_amnt_array = X_train['loan_amnt'].to_numpy()
----> <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=5'>6</a> loan_int_rate_array = X_train['loan_int_rate_decimal'].to_numpy()
      <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=6'>7</a> y_train_array = y_train.to_numpy()
      <a href='vscode-notebook-cell:/Users/prashanthbabu/Desktop/machine_learning/csci-0451/posts/new-new-test-post/index.ipynb#X42sZmlsZQ%3D%3D?line=7'>8</a> profit_repaid = profit_if_repaid(loan_amnt_array, loan_int_rate_array)

File ~/anaconda3/envs/ml-0451/lib/python3.11/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/ml-0451/lib/python3.11/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: 'loan_int_rate_decimal'

Now we are going to find a threshold and identify teh best threshold

# 5. Sweep over thresholds to find the one that maximizes average profit

thresholds = np.linspace(0, 1, 101)
avg_profits = []

for t in thresholds:
    # Predict default if probability >= t
    predicted_default = (y_prob_train >= t).astype(int)
      # If we predict default, we do NOT give the loan => profit = 0
    # If we predict no default, we DO give the loan => actual profit depends on y_train_array
    #    - If actual y=0 (no default), profit = profit_repaid
    #    - If actual y=1 (default), profit = profit_default
    give_loan = 1 - predicted_default  # 1 = give loan, 0 = no loan
    # total_profit[i] = give_loan[i] * [ (1 - y[i])*profit_repaid[i] + y[i]*profit_default[i] ]
    total_profit = give_loan * ((1 - y_train_array) * profit_repaid + y_train_array * profit_default)
    
    # Compute average profit per borrower
    avg_profit = total_profit.mean()
    avg_profits.append(avg_profit)

avg_profits = np.array(avg_profits)

# 6. Identify the best threshold

best_idx = np.argmax(avg_profits) # index of the best threshold
best_threshold = thresholds[best_idx]


print(f"Best Threshold: {best_threshold:.3f}")

Best Threshold: 1.000

# 7. Plot profit vs. threshold

plt.figure(figsize=(8, 5))
plt.plot(thresholds, avg_profits, label='Profit per Borrower')
plt.scatter(best_threshold, best_profit, color='red', zorder=10, label='Optimal Threshold')
plt.title('Profit per Borrower (Training Set) vs. Threshold')
plt.xlabel('Threshold (Probability of Default)')
plt.ylabel('Average Profit per Borrower')
plt.legend()
plt.tight_layout()
plt.show()